Acquisition of Bilingual MT Lexicons from OCRed Dictionaries

نویسندگان

  • Burcu Karagol-Ayan
  • David Doermann
  • Bonnie J. Dorr
چکیده

This paper describes an approach to analyzing the lexical structure of OCRed bilingual dictionaries to construct resources suited for machine translation of low-density languages, where online resources are limited. A rule-based, an HMM-based, and a post-processed HMM-based method are used for rapid construction of MT lexicons based on systematic structural clues provided in the original dictionary. We evaluate the effectiveness of our techniques, concluding that: (1) the rule-based method performs better with dictionaries where the font is not an important distinguishing feature for determining information types; (2) the post-processed stochastic method improves the results of the stochastic method for phrasal entries; and (3) Our resulting bilingual lexicons are comprehensive enough to provide the basis for reasonable translation results when compared to human translations.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Stochastic Language Models for Automatic Acquisition of Lexicons from Printed Bilingual Dictionaries

Electronic bilingual lexicons are crucial for machine translation, cross-lingual information retrieval and speech recognition. For low-density languages, however, the availability of electronic bilingual lexicons is questionable. One solution is to acquire electronic lexicons from printed bilingual dictionaries. While manual data entry is a possibility, automatic acquisition of lexicons from sc...

متن کامل

Cross-Lingual Bootstrapping of Semantic Lexicons: The Case of FrameNet

This paper considers the problem of unsupervised semantic lexicon acquisition. We introduce a fully automatic approach which exploits parallel corpora, relies on shallow text properties, and is relatively inexpensive. Given the English FrameNet lexicon, our method exploits word alignments to generate frame candidate lists for new languages, which are subsequently pruned automatically using a sm...

متن کامل

Domain Tuning of Bilingual Lexicons for MT

Our overall objective is to translate a domain-specific document in a foreign language (in this case, Chinese) to English. Using automatically induced domain-specific, comparable documents and language-independent clustering, we apply domain-tuning techniques to a bilingual lexicon for downstream translation of the input document to English. We will describe our domain-tuning technique and demo...

متن کامل

Automatically Creating Bilingual Lexicons for Machine Translation from Bilingual Text

A method is presented for automatically augmenting the bilingual lexicon of an existing Machine Translation system, by extracting bilingual entries from aligned bilingual text. The proposed method only relies on the resources already available in the MT system itself. It is based on the use of bilingual lexical templates to match the terminal symbols in the parses of the aligned sentences. 1 I ...

متن کامل

A Comparative Study of Knowledge - Based Approachesfor Cross - Language Information Retrieval

Cross-language retrieval systems seek to use queries in one natural language to guide the retrieval of documents that might be written in another. Acquisition and representation of translation knowledge plays a central role in this process. This paper explores the utility of two sources of manually encoded translation knowledge, bilingual dictionaries and translation lexicons, for cross-languag...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2003